University of Tehran , school of ECE

Data Analytics Course

Fall 1400



Hands-on 2

Ask your questions from Yara Mohamadi :D

$ ( click to jump on task )
.
├── Introduction
│   └── Jupyter hack!!
│
├── Working with Beautiful Soup
│   └── Searching with Beautiful Soup
│ 
├── Task1: Football Table
│
├── Task2: Digikala Laptop Search
│   └── cleaning the table
│
├── Working with Selenium
│   └── Selenium Webdriver basics
│       ├─ Another example
│       └─ Waits
│
└── Task3: Extracting ticket informations
    ├── Crawling Mrbilit
    └── Cleaning & Joining the Tables



Introduction

In this Hands-On excercise, you will become familiar with these concepts:


Some of the Web Scraping examples are inspired by this awesome free Persian course by Mr. Hossein Khorang. Please check it out for more insight.


Jupyter hack!!

Run the code below. Now by clicking TAB when writing code, you get a list of all functions and objects and you can enjoy auto completion. I recommend going wild with this feature and using it always! You can also use SHIFT + TAB in front of any function or variable to see its information.


Working with Beautiful Soup

We can send a GET request to any webpage and get frontend's source code. Raw source code is usually messy and difficult to parse...


All you need is a beautiful soup!

Beautiful soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Please install it in your conda environment:

Much prettier, eh?

These HTML tags are exactly what you see when you press F12 on a webpage. more specifically, when you right click and inspect an element in a webpage, you can see which tag it belongs to! try it for yourself! In www.python.org, inspecting Community should look like this:

Screenshot%202021-10-01%20132024.png

We can see which tag this element belongs to (<a> inside a <li> tag). We can also see its attributes (the link (href) it goes to) and its text value ('Community').


Searching with Beautiful Soup

Beautiful soup allows you to search through the source code by tag names and their attributes. The code below finds the first <a> tag which satisfies the given conditions.


What if we need to find all elements that satisfy a condition?


What if we want to access their attributes?


You can use select_one and select in a similar fasion to find and findAll, but these functions are more powerful. They allow defining complex conditions by CSS syntax!

for example, the first line below finds the first with id='nojs' which is inside another . The second line finds all s with class='do-not-print' inside another .

Check out the documentation for more awesome tricks!


What if we want to access the text?


Task1: Football table

The goal of this exercise is to familiarize you more with Inspecting HTML source codes by extracting information from a table.

When working with Persian letters, sometimes requests can get the encoding wrong and show strange characters. If this happens, restart the kernel and run the code again</span></b>